CLICK: Clustering Categorical Data using K-partite Maximal Cliques
نویسندگان
چکیده
Clustering is one of the central data mining problems and numerous approaches have been proposed in this field. However, few of these methods focus on categorical data. The categorical techniques that do exist have significant shortcomings in terms of performance, the clusters they detect, and their ability to locate clusters in subspaces. This work introduces a novel algorithm called Click, which finds clusters in categorical datasets based on a search method for k-partite maximal cliques. Click is able to detect subspace clusters, and outperforms previous approaches by a factor of two to three. It scales better than any of the existing method for high dimensional datasets. These results are demonstrated in a comprehensive performance study on synthetic and real data sets.
منابع مشابه
k-Partite cliques of protein interactions: A novel subgraph topology for functional coherence analysis on PPI networks.
Many studies are aimed at identifying dense clusters/subgraphs from protein-protein interaction (PPI) networks for protein function prediction. However, the prediction performance based on the dense clusters is actually worse than a simple guilt-by-association method using neighbor counting ideas. This indicates that the local topological structures and properties of PPI networks are still open...
متن کاملClustering Numerical and Categorical Data
Clustering is an important technique for data mining which allows us to discover unknown relationships in our data sets. Clustering algorithms that use metrics based on the natural ordering of numbers cannot be applied to categorical (non-numerical) data. In this tutorial we will review the main methods for numerical data clustering (K-Means, Hierarchical Clustering and Fuzzy CMeans) and then s...
متن کاملFinding All Maximal Cliques in Dynamic Graphs
Clustering applications dealing with perception based or biased data lead to models with non-disjunct clusters. There, objects to be clustered are allowed to belong to several clusters at the same time which results in a fuzzy clustering. It can be shown that this is equivalent to searching all maximal cliques in dynamic graphs like Gt = (V,Et), where Et−1 ⊂ Et, t = 1, . . . , T ;E0 = φ. In thi...
متن کاملThe Parallel Maximal Cliques Algorithm for Protein Sequence Clustering
Problem statement: Protein sequence clustering is a method used to discover relations between proteins. This method groups the proteins based on their common features. It is a core process in protein sequence classification. Graph theory has been used in protein sequence clustering as a means of partitioning the data into groups, where each group constitutes a cluster. Mohseni-Zadeh introduced ...
متن کاملOn finding k-cliques in k-partite graphs
In this paper, a branch-and-bound algorithm for finding all cliques of size k in a kpartite graph is proposed that improves upon the method of Grunert et al (2002). The new algorithm uses bit-vectors, or bitsets, as the main data structure in bit-parallel operations. Bitsets enable a new form of data representation that improves branching and backtracking of the branch-and-bound procedure. Nume...
متن کامل